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Abstract: A wealth of epidemiological data suggests an association be- 
tween mortality/morbidity from pulmonary and cardiovascular adverse events 
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associations although the abundance of the data. In this paper we describe 
an SSA (Singular Spectrum Analysis) based approach in order to decom- 
pose the time-series of particulate matter concentration into a set of expo- 
sure variables, each one representing a different timescale. We implement 
our methodology to investigate both acute and long-term effects of PMio 
exposure on morbidity from respiratory causes within the urban area of 
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1. Introduction 

A wealth of epidemiological studies based on time-series analysis has shown ev- 
idence for association between morbidity/mortality caused by respiratory and 
cardiovascular adverse events and the exposure to airborne particles [3]. Sus- 
pended total particulate matter is a complex mixture of organic and inorganic 
substances in either liquid or solid phase. They can vary in size, composition 
and origin and can be characterized both physically and chemically. Particles 
with an aerodynamic diameter of less than 10 microns are referred to as PMio: 
they may be inhaled reaching the upper airways and the lungs, with risk for 
adverse effects on health. 

Assuming counts data y t of daily adverse health event being distributed as 
conditionally independent Poisson given the rate ip t , a standard ecological Pois- 
son regression model (which has been used, with minor variations, in most of 
large scale epidemiological studies [11]) is 

log ((pt) = /? +/3iPM 10 , t + [DOW t ]+S (t, 5i)+S (tem Pu S 2 )+S {umr t , S 3 ) (1.1) 

where PMio.t is measured in /xg/m 3 , [DOWt] is a six-dimensional vector of 
dummy variables pointing the day of the week, and iS (i, Si) is a smooth term 
function of calendar time controlling for seasonality and other trends (the degree 
of roughness being controlled by the smoothing parameter Si); further smooth 
confoundcrs include temperature (temp) measured in °C and relative humid- 
ity (umr) expressed as percentage (meteorological confoundcrs may affect the 
pollution-morbidity association: see [2G] for an interesting discussion). 

Among the most recent results we may cite the MIS A 2, a planned study 
over 15 Italian cities for the period 1996-2002 [4]. Updated city-specific esti- 
mates show an overall RR=1.005 (C.I.: 0.991-1.018 - estimate ±2 std. err.) per 
10 /zg/m 3 increase in PMio concentration for respiratory causes, with a simi- 
lar RR=1.005 (C.I: 1.000-1.010) for cardiovascular causes. The estimated lag-3 
days overall RR of hospitalization due to respiratory causes is 1.006 (C.I.: 1.002- 
1.011), while RR=1.003 (C.I.: 1.000-1.006) is estimated for cardiovascular ad- 
verse events. Another important and recent study is the European meta-analysis 
conducted by the Regional Office of World Health Organization (WHO) for Eu- 
rope, based on 17 country-specific estimates [1]; the mortality RRs reported in 
the WHO-mcta analysis are 1.009 (C.I.: 1.005-1.013) for cardiovascular causes 
and 1.013 (C.I.: 1.005-1.020) for respiratory ones (per 10 /itg/m 3 increase in 
PMio). 

Despite this growing body of evidence, a considerable uncertainty remains to 
be seen: this begs the question of whether these associations represent premature 
mortality within only few days among those already near to death. Such a dis- 
placement (or harvesting) effect has been discussed by several authors after [19], 
and can complicate the interpretation of the results: a reasonable underlying hy- 
pothesis is that mortality/morbidity displacement is related to associations on 
shorter time scales, while longer time scales are supposed to be resistant to 
mortality displacement. If associations reflect only harvesting, from a public- 
health point of view, the effect of air pollution on morbidity can be considered 
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as having a limited impact. A first attempt to assess short and long-term ef- 
fects was proposed in [28] by using Cleveland STL decomposition by means of 
LOESS smoothing algorithm to separate the time-series of daily deaths into 
long, intermediate and short (residual) timescale series. Similarly, [10] devel- 
oped a methodology based on the Discrete Fourier Transform to obtain a set 
of orthogonal predictors at given timescales, by partitioning the base interval 
[0, 7r] into a given set of Fourier frequencies. Expanding time series of particu- 
late matter concentration into a set of lagged exposure variables is a concurrent 
approach: a distributcd-lag model was proposed in [36] by replacing the pollu- 
tant effect with a distributed lag-specification, each lag coefficient representing 
a specific contribution to exposure: cumulative effects on a given timescale can 
be obtained by summing up contributions for a given range of lags. 

The aforementioned methods share a common drawback: suitable timescales 
need to be arranged by the researcher in advance, rather than being a natural 
result of the data analysis process. For example, [28] examines mid-scale compo- 
nents of the daily number of deaths with smoothing windows of 15, 30, 45 and 60 
days: risk estimates are provided for each mid-scale window without attempting 
to provide a data based criterion to choose among diverse alternatives. Simi- 
larly, [10] estimate the association between air pollution and mortality using six 
fixed timescale: > 60 days, 30-59 days, 14-29 days, 7-13 days, 3.5-6 days and 
< 3.5 days. In a word, current approaches to mortality displacement estimation 
do not provide automatic, data-driven methods to decompose air pollution time 
series into a set of suitable exposure variables, each one representing a differ- 
ent timescale. Improvements in this field would be greatly beneficial in public 
health time series studies. For this reason, we propose an alternative approach 
based on Singular Spectrum Analysis - SSA [14]. The word "spectrum" may 
be quite confusing here, since SSA is not derived from Fourier analysis, but it 
is an algorithmic technique rooted in dynamical system theory, linear algebra 
and multivariate geometry. SSA can be defined as a model-free approach to de- 
compose a time series in easy-to-intcrprct components such as trend, harmonic 
intermediate components and pure noise (short scale residual) . This task can be 
accomplished by exploiting a functional clustering algorithm based on a suit- 
able metrics that allows a sensible grouping of more "elementary" components. 
No fixed timescales need to be known in advance in our novel approach: the 
proposed methodology is used to test the harvesting hypothesis on a dataset of 
residents in the city of Bari (Apulia, Italy). 

The paper is structured as follows. Sec. 2 briefly reviews the data and the pre- 
processing methods used to deal with missing information and outliers; Sec. 3 
gives a short introduction to SSA; the first part of Sec. 4 develops a functional 
clustering algorithm to group elementary components into interprctablc expo- 
sure variables at several timescales; the second part of Sec. 4 describes timescale 
estimation by means of GAM models with integrated smoothing parameter se- 
lection, we reported also some results about our data set. Finally, Sec. 5 gives 
a brief discussion about the results and outlines future research opportunities. 
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2. Data description and pre-processing 

Epidemiological data were obtained from Apulian Regional Epidemiological 
Center, concerning the daily time-series of hospitalized people among residents 
in the city of Bari between June 1th, 2000 and December 31th, 2001 (in total 
N = 579 days), diagnosed as suffering from pulmonary diseases (ICD-IX Clas- 
sification: 460-519). Time series of particulate matter concentration and meteo- 
rological data were collected by a monitoring network subgroup of the Munici- 
pality of Bari (Department of Environmental Protection and Health), including 
four monitoring stations named "S. Nicola", "King", "Savoia", "Cavour" that 
collected information concerning a wide set of pollutants, such as Benzene, CO, 
NO, NO2, NOz, O3, and SO2. It is worth noticing that most time-series present 
a large number of missing data points. 

Bi-hourly measurements of PM10 were available on each of the four monitor- 
ing stations, whereas temperature and relative humidity were available on an 
hourly basis on "S. Nicola", "Savoia" and "Cavour" stations only. These data 
have been pre-processed in order to calculate an overall daily series for each 
variable (from midnight to midnight the day after). Preliminary data analysis 
concerned of the adjustment for most disturbing outliers attributable to tem- 
porary failures in monitoring devices. In fact, we computed a robust estimate 
of the covariancc matrix of each multivariate time-series by using the Minimum 
Covariance Determinant (MCD) estimator (see [25] for details: the robust MCD 
estimator requires much less data for reliable results than the classical covari- 
ance matrix estimator, and gives more interpretable results as extreme values 
are well isolated.). We set an empirical rule by removing the five multivariate ob- 
servations that showed the highest Mahalanobis distance (from the barycentre) 
based on MCD estimate. Fig. 1 shows some details for the PM10 series : most 
of the bi-hourly data are considered as to be missing, as the MCD estimation 
algorithm removes forcibly all the rows containing at least one missing datum. 

For each monitoring station we recovered daily measures by averaging the 
twenty-four hourly observations (the mean of the twelve two-hours observations 
in the case of PM10) when at least 75% of one- hour observations were available 
(this criterion is compliant with APHEA protocol, [2]); otherwise, the daily 
datum was considered missing. After averaging over time, there were still a lot 
of missing values in the PM10 series, as it is shown in the right panel of Fig. 2. 
The left panel contains the daily time series of hospitalizations. 

Some exploratory statistics obtained after daily averaging arc reported in 
Tab. 1: it is readily apparent that missing data in the PM10 series is about 60% 
in the "King" station. Comparable values were observed for the other monitoring 
stations; temperature and relative humidity were far less difficult to analyze. In 
order to obtain an overall daily measure for each variable, we computed a spatial 
average of each daily time-series. This synthesis of information is mainly based 
on the assumption of constant exposure over the whole urban area. This is a 
reasonable assumption for temperature and relative humidity, whereas for the 
daily PM10 series this can be justified on the ground that correlation coefficients 
between the measurements ranged from 0.66 to 0.87. 
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Fig 1 . LEFT - Classic squared Mahalanobis distance for each multivariate observation of the 
PMio hourly series. RIGHT - Robust Square Mahalanobis distance calculated through the 
output of the MCD algorithm (Minimum Covariance Determinant Estimator). 
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Fig 2. LEFT - Time-series of hospitalized people between June 1th, 2000 and December 
31th, 2001, diagnosed as suffering from pulmonary diseases (ICD-IX Classification: £60-519). 
RIGHT - Daily time series of PMiq collected at the stations "Cavour" , "Savoia", "King" and 
"San Nicola". 



Some synthetic values of the reconstructed series can be found in Tab. 2; it 
is worth noticing the dramatic reduction in number of the missing points in the 
PMio series. There were still a few missing values in correspondence of those 
days for which the four PMio measurements were not available. In this case, 15- 
days causal moving averages was used for imputing missing values. The ultimate 
result of this information filtering and retrieval work is shown in Fig. 3. 
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Table 1 

Some exploratory statistics for PMio, temperature (temp) and relative humidity daily (umr) 

time-series 



Variables 


Days 


Missing 


% Missing 


Min 


Median 


Max 


PMio 














S. Nicola 


579.00 


304.00 


52.5 


8.195 


35.431 


94.083 


King 


579.00 


350.00 


60.45 


9.354 


23.345 


54.637 


Savoia 


579.00 


270.00 


46.63 


21.945 


59.769 


226.767 


Cavour 


579.00 


321.00 


55.44 


6.954 


47.186 


280.388 


temp 














S. Nicola 


579.00 


86.00 


14.8 


2.065 


19.481 


34.565 


Savoia 


579.00 


15.00 


2.6 


0.159 


19.356 


35.411 


Cavour 


579.00 


20.00 


3.45 


2.537 


20.549 


34.546 


umr 














S. Nicola 


579.00 


114.00 


19.68 


19.955 


64.844 


97.752 


Savoia 


579.00 


15.00 


2.6 


30.456 


67.672 


95.46 


Cavour 


579.00 


27.00 


4.66 


17.22 


66.588 


80.264 



Table 2 

Some exploratory statistics for the spatially aggregated PMio, temperature (temp) and 
relative humidity daily (umr) time series 



Variables 


Days 


Missing 


% Missing 


Min 


Median 


Max 


PMio 


579.00 


12.00 


2.07 


13.907 


40.46 


253.58 


temp 


579.00 


0.00 





1.587 


19.202 


34.841 


umr 


579.00 


0.00 





27 


67.2 


94.25 




Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 
200D 2001 



FlG 3. Reconstructed PMio, temperature and relative humidity time-series. 
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3. Singular Spectrum Analysis 



In this section a brief review of Singular Spectrum Analysis (SSA) is provided: 
detailed description can be found in [13; 14]. We denote the daily PMio,* time- 
series by {%t}t=o 1: SSA relies on the Karhunen-Loeve decomposition of a covari- 
ance matrix estimate of K lagged copies of the original time series [29; 34; 35]. 

Let L < [N/2] be a fixed integer called window length and introduce K = 
N — L + l lagged vectors (xt-i, Xi-2, ■ ■ ■ , £i+L-2) T , for i = 1, . . . , k: the first step 
consists of embedding the original time series {xt}^^ 1 into a lower-dimensional 
manifold than the original phase space, by transforming it into the following 
L-trajectory matrix 



( 



X = 



x\ 

X2 



X\ 
X2 
■I';', 



\ X L ~\ X L 



XK-1 \ 

X K 
XK+1 

XN-l j 



Some celebrated results in dynamic system theory (a good reference is [5]) en- 
sure that all the qualitative (topological) characteristics of the phase space are 
preserved after such a dimensionality reduction process. It is worth noticing that 
every L-trajectory matrix is Hankel, i.e. its entries coincide along the secondary 
matrix diagonals such that i + j • — s for 2 < s < L + K, and vice versa every 
L x K Hankel matrix is the L-trajectory matrix of some time series. 

In the second stage, further information collapsing is carried out by comput- 
ing the eigenvalues \ of the L x L matrix S = XX T . Let d = rank (A) = 
rank (AA T ) be the number of nonzero eigenvalues of the matrix S: if d < L 
then Ai > A2 > . . . > Ad > and A^+i = for all other eigenvalues with indexes 
larger than d. The trajectory matrix can be decomposed into its Singular Value 
Decomposition (SVD, [12]) 



X = Ai 



Xi 



A, 



(3.1) 



where A.; = y/Xiuivf , yAi > • ■ • > vKi > are the singular values of S, 
Ui E R L (i = 1, . . . ,d) is a system of orthonormal eigenvectors associated to 
nonzero eigenvalues of A and Vi = X T Uij \J\i. Hence, the trajectory matrix is 
decomposed into a sum of elementary rank-one, pairwise bi-orthogonal matri- 
ces. The collection Ui, will be referred to as the ith eigentriple of the 
SVD (3.1). 

In the third phase, SSA attempts to reconstruct components {xet} 1 ^ 1 such 
that the original time series can be decomposed into the sum of p time-series 



Xi 



^2x et , t = 0, 



,N-1 



(3.2) 



e=i 



which can have meaningful interpretations. Reconstruction of such components 
requires a suitable grouping of the set of indexes I = {1, . . . , d} into p disjoint 
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subsets I\ , . . . , I i , 
can be reformulated as 



I p with Ig 



, }, such that the SVD expansion 



X = X T 



x, 



Xr 



(3.3) 



where Xi e = Xi 1 +. . -+Xi n . If all the matrices on the right hand-side of (3.3) are 

Hankel, then they are L- trajectory matrices from which components {xit}^^ 
on different timescales can be easily reconstructed. 

Nevertheless, this situation rarely occurs in practice: in most real data sets 
no sensible grouping can be found such that X] 1 , . . . , Xj are L-trajectory ma- 
trices. Last SSA phase consists of Hankelization or diagonal averaging of re- 
sultant matrices Xj ± , . . . , Xr . Let Y be any L x K matrix with elements , 

L* = min (L, if), K* = m&x(L, K), and ||^||^ = J2i=iJ2f=iVij the squared 
Frobenius-Perron norm of the matrix Y. Hankelization is carried out by means 
of a linear operator Ti acting on the space of L x K matrices: the resulting 



matrix Z = TLY with elements is defined as follows (,s 



J) 



i=i 



1 

= i — E-'C- .. 

3=1 



for 2 < s < L* 



for L* + 1 < s < K* + 1 



TTo E Vl'-i tovK*+2<s<N + l 



N -s + 2 



(3.4) 



where = if L < K and y*j = y^j otherwise. It can be proved that Z is 
Hankel and thus it is the L-trajectory matrix of some time series. It is also the 
best approximation to Y in the sense of Frobenius-Perron norm [6]: if M.ly.k is 
the space of real L x K matrices, and M. L J K is the linear subspacc of Hankel 
L x K matrices, then ||V — Z\\ 2 is minimum, so that it readily follows that 
TL : M.L/K — ► -M-lxk i s an orthogonal projector operator and H.X = X. By 
applying diagonal averaging to both members of (3.3) every resultant matrix 
Xjj , . . • , Xt produces an L-trajectory matrix, from which the decomposition 
(3.2) can be easily recovered. 



4. Eigentriple grouping 

4- 1. Reconstruction of components 

We can consider the lag-covariance matrix C = S/K instead of S for obtaining 
the singular values. C is a sort of non-centered covariance matrix among columns 
of X (the L-lagged vectors), and its use is justified by the fact that from Cui = 
Xfui it follows that Sui = (K\f) u,. The latter expression shows that the 
orthonormal system {u{\ is unaffected by the choice of C, and the only difference 




Fig 4. LEFT - Singular values for eigentriples (1-60). RIGHT - Eigenvalues shares for 
eigentriples (1-60). For both graphics a semi-logarithmic scale on the Y axis has been used. 



in the SVD of X lies in the magnitude of the corresponding eigenvalues (they 
are K times larger for S). This fact simplifies the visual mining of component 
series. We set L = 60 for the window length, as adverse effects at timescales 
longer than two months arc likely to be confounded with long-term effects due 
to other causes. 

The left panel of Fig. 4 shows spectrum of each singular value. Singular 
values are very close to each other, except for a large dominant value which 
prompts for a long-period slow- varying component (trend), and several plateaux 
that correspond to shorter period oscillatory components or pure noise. The 
right panel shows the degree of approximation of each single component of the 
SVD of X: it can be proved that = Y7i=i A» an d that X — X Ie for 

h = {h,..., £ n ,} C {1, . . . , d} is the SVD of X h with I 3 = {1, . . . , d} /I t , hence 
a measure of the degree of approximation referred to as eigenvalues share of the 
eigentriples with indexes 1/ = {li , . . . , £ ne } can be defined in the following way 

\\X-X Il \\l l _ EjU Aj - E Jg{ i,...,d}// t A, _ A fl + . . . + 
\\X\? M Ai + ... + A«i Ai + ... + Ad 

The largest eigenvalue (\/Ai = 357.85) accounts for about 88% of the total 
variability: this fact strengthens the belief that the corresponding eigentriple can 
be identified with slowly- varying component (trend), whereas the individuation 
of further components demands for a more elaborate algorithm. 

A suitable decomposition of the PMio series can be determined by modifying 
the four-step algorithm described in the previous section. We suggest to apply 
Hankelization to each term in the full SVD decomposition (3.1): if X^ = HXi 
then 

X = X[ H) + . . . + X!> H) + . . . + X™ (4.1) 

with p = d = L = 60, being S = XX T non singular. In general, elementary han- 
kelized matrices on the right hand side of (4.1) are not pairwise bi-orthogonal 
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(row and column orthogonal) matrices, hence the sum of two of such matri- 
ces does not need to be Hankel. It can be easily proved that Xg and Xj 
arc bi-orthogonal if and only if (X^\Xj )m = 0, where (xj , Xj^^m = 
J2i=i x ij IS the inner-product compatible with the Frobenius-Perron norm. 

Additionally (see [14], page. 258), it can be proved that since {X g , X) )m = 

(n-/\ (7-{\ 

0, Xg + Xj is an Hankel matrix. Hence it is the L-trajectory matrix of 
some component time series: this condition will be referred to as weak L-sepa- 
r ability. 

By joining elementary hankelized components having minimum distance in 
terms of weak L-separability, we often obtain a sensible grouping (3.3) whose 
component matrices are as close as possible (in the sense of the Frobenius-Perron 
distance) to Hankel matrices, hence amenable to a suitable interpretation after 
the diagonal averaging step. A sensible measure of weak L-separability between 
components I and j is the w- correlation 

(M) _ \ A l i A J )M 

w " 3 ~ wx^WmWx^Wm 

where £,j = 1, . . . ,L and \\X™\\ M = «xf \ xf ] ) M )^ 

If the absolute value of the ^-correlations is small then the corresponding 
series are almost w-orthogonal, but if is large then the two series are 

far from being w-orthogonal and therefore badly separable. If the matrix W = 
{\ w \ J j^ > W nas a quasi-block diagonal structure, eigentriple grouping can be done 
accordingly by joining elementary components belonging to the same block. 
Unfortunately, real datasets usually have an entangled structure, except for 
some theoretical examples of little importance: Fig. 5 shows a 100-gray level 
representation of the W matrix for the PMio series. Darkest gray on the main 
diagonal corresponds to |u>L | = 1: a sharp block structure apparently stands 
for groups {1} and {2 — 6} only. 

To overcome this difficulty we define the matrix 1 — W = {1 — |ui^^|}. 
From a formal point of view 1 — W is a dissimilarity matrix: it is symmetric 
and off-diagonal elements arc strictly positive. Similar components have small 
iw-correlations (in absolute value): equivalently, they have large values in the cor- 
responding entries of the dissimilarity matrix 1 — W. Therefore, it is natural to 
group elementary components by clustering hierarchically the elementary han- 
kelized series taking 1 — W as the distance matrix. Consequently, the complete 
linkage method seems to be the best choice. 

Results are shown in the right panel of Fig. 5: almost p = 5 groups are 
apparent (the choice of p = 3 and p = 4 is not compatible with the dendrogram 
branching). According to the clustering output, the following decomposition into 
p = 5 groups can be done: Gi = {1 — 6}, Q2 = {7 — 23}, G3 = {24 — 33, 35, 36}, 
Gi = {34, 37 — 44}, G5 = {45 — 60}. Elementary components in (3.1) are grouped 
accordingly, and transformed into Hankel matrices to obtain the L-trajectory 
matrices of the corresponding reconstructed series. 
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Full reconstruction (p=60) 



Fig 5. LEFT - 100 gray-level representation of W matrix for hankelized elementary com- 
ponents of the SVD of PMio time-series. RIGHT - Hierarchical clustering of elementary 
hankelized series with 1 — W as dissimilarity matrix. Complete linkage method is used: p=5 
groups are highlighted. 

Periods of the reconstructed components are not fixed in advance: there- 
fore, the periodogram analysis of each reconstructed component may help us to 
estimate the corresponding timcscales. More sophisticated approaches include 
non-parametric or parametric ARMA-based spectral density estimation [7; 8], 
which are both based on a suitable smoothing of the Fast Fourier Transform out- 
put, and require considerable skill for a proper application. For this reason, we 
suggest a faster and heuristic approach to period estimation. Period is defined as 
the amount of time necessary to complete a cycle: hence the number of turning 
peaks (local maxima) observed during the N = 579 days will be a rough esti- 
mate of the frequency (the number of cycles in the unit time). According to our 
definition, an approximate period estimate is given by the inverse of this quantity 

Number of days in the observation period 

II \ Qi) = : 

Number of peaks in the iih reconstructed component 

Exploiting this simple estimator we found Tl{Gi) — 27.57, II (£2) — 7.24, 
n(£/ 3 ) ~ 3.97, n(C? 4 ) ~ 2.84, II(<? 5 ) ~ 2.46: "day" is the natural measurement 
unit. 

It is not surprising that some redundancies may arise: for example, estimated 
timescales II (£4) and II (£5 ) are almost identical. This result can be explained 
by noting that there does not exist a simple one-by-one correspondence between 
weak L-separability and the frequency content of the elementary series which 
are joined together by our functional clustering algorithm. For this reason, if 
[n((?^)] = [Il(Cfj)] we shall say that reconstructed components t and j are 
non-identifiable (by [■] we denote the integer part function): in this case we pre- 
scribe that the corresponding elementary scries should be merged into a single 
group, and a new component should be reconstructed by diagonal averaging. 
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This criterion seems to be reasonable for the application at hand, but it may 
be inappropriate in other contexts. For example, an orthogonal design is based 
on a set of orthogonal exposure variables: unfortunately w-orthogonality does 
not imply classical (Euclidean) orthogonality, with the result that reconstructed 
components are not pairwise orthogonal. In fact only elementary components 
in the right-hand side of (3.1) have this property, being the principal compo- 
nents of the L-trajectory matrix. In this case, a suitable additional constraint 
to remove non-idcntifiability would be \R.g e ,g j \ < e, which consists in merging 
reconstructed series I and j for which the Pearson correlation coefficient Rg e ,g j 
is below a predetermined tolerance e. 

In our case components Q4 and Q5 arc non-identifiable according to our defi- 
nition and Rg^g 5 — 0.19. Therefore, a new grouping into p = 4 components can 
be obtained by merging elementary series formerly classified into groups Q4 and 
Qb'- Qnew.i = Qi, Qnew.i = Q2, Qnew.3 = Q3 and Gnew.4 = {34,37-60}. The new 
reconstructed components are shown in Fig. 6: looking at the first two panels, a 
remarkable overfitting of the longest period waveform is apparent. We did not 
explore this feature, as prediction or change-point detection are not the objec- 
tives of our analysis. We propose the following interpretation of the estimated 
series: 

• Qnew.i'- eigenvalues {1 — 6} are grouped. Given that H(Q n ew.i) — 27.57, 
this variable measures particulate matter exposure at a timescale of about 
four weeks; 

• Qnew.i'- eigenvalues {7— 23} arc grouped. This scries can be considered as 
a proxy for exposures occurred in the past week, as II (Gnew.2) — 7.24; 

• Qnew.3- eigenvalues {24 — 33,35,36} are grouped. The reconstructed se- 
ries is a short-period predictor, measuring lagged exposure at a five-day 
timescale: Tl(Q ne w.s) — 3.97; 

• Qnew.i'- eigenvalues {34, 37 — 60} are grouped. Very short periods of about 
three days or less are mixed up, as II (Qnew.i) — 2.63. 

Of course, a main feature of SSA is that it encompasses an automated data- 
driven approach for decomposing the time series into timescale components: 
reconstructed short-period series can be used for testing the mortality displace- 
ment hypothesis. In addition, long-period air pollution effects can be estimated 
in correspondence of exposure variables that can be easily interpretable, and 
that vary at timescales of scientific interest (such as seasonality and trend). We 
address this issue in the next section. 

4-2. Timescale estimation 

A GAM model accounting for the exposure variables at given timescales can be 
formulated as 

p 

\og(ip t ) = h t +Y^Ptxu (4.2) 
1=1 
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Fig 6. Original PMio and p = 4 reconstructed components according to the following group- 
ing: Gnew.l = {1-6}, g new .2 = {7-23}, g„ ew . 3 = {24-33,35,36}, g„ ewA = {34,37-60}. 

where h t is a shortcut for the intercept, the dummy vector [DOW*] and the 
smooth components, while {xu}^^ is a suitable set of exposure variables cho- 
sen to account for timescalc effects. Our functional clustering method can be 
considered as a dimensionality reduction algorithm that allows for parametriza- 
tion of model (4.2) in term of a small number of waveforms xu] other meth- 
ods, such as standard principal components analysis (PCA) and independent 
component analysis (ICA, [20]) assume mathematically convenient constraints 
(orthogonality and independence) but have no meaningful justification as they 
cannot decompose the original data into a set of exposure variables and com- 
plicate the interpretation of components. 

The initial SSA decomposition can be modified on the ground of a careful 
consideration of both estimated timcscales and singular value amplitudes (see 
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Fig 7. Original PMiq series and estimated trend on the basis of the new grouping strategy: 



(2) 



Fig. 4). For the first component we have H(Gnew.i) — 27.57, a timescale which 
is relatively shorter than the window length (L = 60). Intuitively, the first group 
includes too many elementary components, with the result that lower frequency 
harmonics have been smoothed away. It should be noted that a single leading 
singular value is apparent from Fig. 4: therefore, a new decomposition can be 
obtained by splitting group 1 into the two sub-groups {1} and {2 — 6}. The 

(2) (2) (2) 

new decomposition is given by: G^v/.i = {1}, = i 2 ~ 6 I> ^new.3 = £new.2, 

Snel.4 = Sncw.3, ^.5 = Sncw.4- We have II^.l) = 14 4-75 and II(0<^. a ) ~ 

27.57 days: Fig. 7 shows the new grouping slow-varying component (trend) and 
original series. 

Of course, other answers may be sensible: a trend plus season solution can 

(2) 

be estimated by widening Q n( ! w l in a suitable way. The estimated timescales 
of elementary components entering group G^J W 2 = {2 — 6} are IT (2) ~ 64.33, 
n (3) = 48.25, Id (4) ~ 32.17, fl (5) ~ 26.32, n (6) ~ 22.26. Therefore, it is logical 
to join the first and second leading singular values to obtain the following new 
decomposition: Q^J w l = {1 - 2}, G^J w 2 = {3 - 6}, G^J w 3 = G nQW . 2 , GnL.i = 

Sncw.3, ^l.s = £„cw.4, with n(^.i) - 72.38 and n(eiel. 2 ) - 2 7-57. 

Although the overall number of suitable alternative decompositions is not 
large, a subjective adjustment seems to be still required to obtain a correct 
and easy-to-interpret decomposition. A simple procedure for forward traversing 
the space of all sensible decomposition can rely on estimating the statistical 
goodness of fit of each set of exposure variables entering the GAM model (4.2), 
after adjusting for the increasing complexity by means of a suitable penalty: 
data-driven model choice is certainly appealing since the interest is focused on 
the relationship between air pollution and morbidity. A computationally feasible 
solution is the UBRE criterion - a rescaled version of the AIC statistics, sec 
[16], pg. 160 - based on the minimization of the expected mean squared error 
(details are discussed in [32] and [33]): for n independent observations from an 
exponential family with scale parameter <j> (in the Poisson case </>=!) the UBRE 
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has the following expression 



1 



77 



2tr (R) <f) 



UBRE 



n 



n 



where Yle=i D {ye', P-e) is the log-likelihood ratio statistics (Deviance) for the 
current model, R is the weighted linear operator that produces estimates of the 
adjusted dependent variable at each step of the GAM local scoring estimation 
algorithm, and tr {R) represents the overall effective degrees of freedom of the 
model (see [32] for details). It can be proved that UBRE is a simplified version 
of a generalized cross-validation (GCV) criterion that is valid when the scale 
parameter is not known; in addition, it is very similar to the GCV-PMio criterion 
introduced in [22]. 

Another benefit of automatic model selection concerns the choice of degrees 
of freedom (dfs), associated with smooth terms entering model (4.2) to adjust for 
temporal unmeasured confounders and meteorological variables. Each smooth 
term is a natural cubic spline, which can be derived from a conditional mini- 
mization problem with respect to coefficients of a suitable function basis, given 
a quadratic penalty depending on a symmetric smoothing matrix S {Sj) [17]: for 
fixed values of the smoothing parameters Sj , parameter estimation of the GAM 
model (4.2) is easily shown to be equivalent to a conditional minimization prob- 
lem with multiple quadratic penalties [33]. It is usually difficult to decide the 
amount of smoothing that is allowed for smooth functions in model (4.2): fixing 
each df to a pre-specified value may be sometimes justified on the ground of 
biological knowledge and of the problem peculiarities, although it is likely to 
invalidate the reproducibility of epidemiological findings [23]. For example [9] 
estimate a model very similar to (1.1) (except for the exclusion of the relative 
humidity term) by setting S\ = 7, 62 = 8: an identical model is estimated in 
[ ] by choosing Si = 7 x years -1 (i.e. Si = 3.5 for a two years observation pe- 
riod) and = 6. We know that data-driven model choice is a necessary task in 
particulate matters time-series studies, even if the number of candidate models 
may be prohibitive: when k smoothing terms are allowed into the model (as 
in our case), simultaneous fixed parameter estimation and smoothing param- 
eter selection based on UBRE minimization over a fc-dimensional space were 
computationally infeasible until the methods reported in [30; 31] were recently 
introduced. The related software, the R mgcv: :gam V. 1.3-29 library has been 
used for model estimation and selection throughout this section. 

4-3. Results and some comparisons 

To test the harvesting hypothesis we assumed that model (4.2) holds: evidence 
for short-term effects may reflect increased recruitment into the frail pool caused 
by air pollution (and not by a former disease condition): Table 3 shows effect 
estimates and approximate p- values for each predictor entering a model based on 
the Q n ew decomposition. The estimated degrees of freedom of each smooth term 
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Table 3 

Effect estimates, approximate significance of smooth terms and global model scores: the 
decomposition of PMiq used here is G new .i = {1 — 6}, G new .2 = {7 — 23}, 
Gnew.3 = {24 - 33, 35, 36}, G newA = {34, 37 - 60} 



Approximate significance of parameter estimates 





Estimate 


Std. Error 


z value 


Pr{> |z|) 


(Intercept) 


2.f8527I4 


0.0614917 


35.538 


<2e-16 


pmlO.l.new 


0.0002876 


0.0011013 


0.261 


0.794 


pml0.2.new 


-0.0012f47 


0.0012534 


-0.969 


0.332 


pml0.3.ncw 


0.00f8025 


0.0024178 


-0.746 


0.456 


pmI0.4.new 


0.0008980 


0.0019772 


0.454 


0.650 


Approximate 


significance of smooth terms 








edf 


Chi.sq 


p- value 




S(day) 


8.879 


186.168 


<2e-16 




S(temp) 


2.352 


10.407 


0.0645 




S(umr) 


1.976 


5.668 


0.2253 




Global scores 




R-sq.(adj) 


Dev. explained 


UBRE score 


n 




0.501 


51.6% 


0.36048 


579 



(edf) are inversely related to the smoothing parameter estimates: in order for 
a GAM to be identifiable each smooth evaluated at its covariate values should 
sum to zero ([16] refers to this as 'centering' the smoothing). This identifiabihty 
constraint removes one degree of freedom, thus the inferior limit for any smooth 
which is not curved at all (a straight line) is 1 (at variance with bounds on 
degrees of freedom for free smoothing splines, ranging between 2, a straight 
line, and +oo, a perfectly interpolating spline). 

Conditionally on Q new we found little evidence for mortality displacement, as 
well as there were no associations present on longer timescales. We investigated 
the sensitivity of our results with respect to the chosen decomposition: the 
model based on the Q new set provided comparable global scores (UBRE=0.3618 
and deviance explained equal to 51.6%). The only significant effect occurred 
at Gnew.i (adjusted RR=1.02 with C.L: 1.001-1.039), although a four-month 
association between morbidity and air pollution is quite unrealistic and it may 
be likely due to data-snooping. 

{ 31 

When Gnew was used, global model scores showed a significant decrease (ap- 
proximate Anova tests comparing the current model with the previous ones 
were both significant). Results are shown in Table 4: the smooth term control- 
ling for unmeasured temporal confounders was significant, as well as significant 
effects occurred at G n ew.i anc ^ G„ ew ,2 timescales. These results suggest to reject 

(3) 

the harvesting hypothesis: the negative effect estimated at the G n J w l timescale 
(corresponding to a relative-risk decrease) may be associated with a pool of 
healthy individuals which are still healthy two months after exposure to air 
pollution. The largest effect occurred at the timescale of one month (RR=1.00, 
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Table 4 

Effect estimates, approximate significance of smooth terms and global model scores: the 
decomposition of PM 10 used here is G^ w A = {1 - 2}, G„^ w 2 = i 3 ~ 6 )> Qnew.3 = i 7 ~ 23 i> 
GnL.4 = i 24 " 33, 35, 36}, 6^. 5 = {34, 37 - 60} 



Approximate significance of parameter estimates 





Estimate 


Std. Error 


z value 


Pr(> |*|) 


(Intercept) 


2.5895118 


0.1390330 


18.625 


<2c-16 


pml0.1.new(2) 


-0.0087046 


0.0029872 


-2.914 


0.00357 


pml0.2.new(2) 


0.0039257 


0.0016050 


2.446 


0.01445 


pml0.3.new(2) 


-0.0011658 


0.0012576 


-0.927 


0.35392 


pml0.4.new(2) 


0.0019135 


0.0024188 


0.791 


0.42889 


pm!0.5.new(2) 


0.0009902 


0.0019600 


0.505 


0.61341 


Approximate significance of smooth terms 




edf 


Chi.sq 


p- value 




S(day) 


8.936 


166.976 


<2e-16 




S(temp) 


3.006 


13.403 


0.0629 




S(umr) 


2.016 


6.807 


0.2354 




Global scores 




R-sq.(adj) 


Dev. explained 


UBRE score 


n 




0.508 


52.4% 


0.34438 


579 



C.I. =1.001-1. 007), which corresponds to a relative-risk increase of about 4% per 
10 /Ltg/m 3 increase of PMio concentration. These findings are quite comparable 
to those of [10], which reported (by means of their FFT-based decomposition 
method applied to both cardiovascular and respiratory causes in four US cities) 
larger effects at longer timescales from 14 days up to 2 months. 

We applied the Dominici's methodology to our PMio data for a direct com- 
parison: suitable R software (the decompose () function described in [10]) was 
used to generate two new sets of exposure variables. For the first one, the 
"breaks" in the decompose function have been set to 1, 19, 41, 83, 165, 579 days: 
except for the first and the last, these values are defined as 579/r where r = 
30, 14, 7,3.5. This choice was used as a standard in [10], and it generated quite 
equivalent waveforms to those obtained by the SSA Q n J w -b&sed decomposition. 
According to [10] the interpretation of such timescales is: more than 30 days 
(timescale 1, trend plus large scale periodicity, see Fig. 8), 14-30 days (timescale 
2) and so on down to 1-3.5 days (timescale 5). A second decomposition was gen- 
erated by setting breaks equal to 1, 21, 80, 146, 220, 579; except for the first and 
the last these were obtained by taking r = 27.57, 7.24, 3.97, 2.63. The three 
decompositions (in a word SSA, FFT-A and FFT-B) are shown side by side 
in Fig. 8: surprisingly, no significant effects occurred conditionally on FFT-A 
(see Table 5), an impractical result which was confirmed by the poor global 
model score performance (UBRE=0.3608). Identical results occurred for FFT- 
B (UBRE=0. 36048, deviance explained 51.7%, no significant effects): therefore, 
it must be stressed that SSA decomposition provides a better explanation of 
the data, the number of predictors entering the model being equal. 
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SSA-based decomposition Fourier decomposition (A) Fourier decomposition (B) 




time in days time in days time in days 



Fig 8. LEFT: SSA-based decomposition according to the following grouping: Gnew.i = {1~2}, 
Gnew.i = {3 — 6},Gnew.3 = Ql, GnewA = Gz, Gnew.s = Gi- MIDDLE: Fourier decomposition 
obtained by applying the R decomposeO function with breaks egual to 1, 19, 41, 83, 165, 579 days 
(except for the first and the last breaks, these were obtained as 579/r, where r = 30, 14, 7, 3.5). 
BIGHT: Fourier decomposition obtained by setting breaks egual to 1, 21, 80, 146, 220, 579 (ex- 

(3) 

cept for the first and the last breaks, these were defined as 579/n(5„ euJ j) for j = 2, 3, 4, 5 ). 

Most of the literature report little evidence for mortality displacement due to 
PMio: our findings reinforce the conclusion that the increased chronic morbidity 
risks associated with even small increase in PM 10 are well established, although 
these increases are not greater for susceptible populations. In concordance with 
our results, several authors report a lag of about four weeks between the pollu- 
tant exposure and an increase of the mortality/morbidity risk: for example, [21] 
reported significant effects on cardiovascular-respiratory mortality in Sydney, 
Australia, at longer timescales (one month or more). Similarly, [36] assumed a 
distributed lag model for mortality due to natural causes in Milan, Italy, in- 
cluding lags up to 45 days: the authors reported an estimated total suspended 
particulate relative risk RR=1.037 (C.I.: 1.019-1.056) for the first 15 days, close 
to zero for lags between 16 and 30, and RR=1.027 (C.I.: 1.019-1.56) for lags up 
to 45 days. 
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Table 5 

Effect estimates, approximate significance of smooth terms and global model scores: a 
Fourier decomposition obtained by applying the R decomposeO function was used, with 
breaks set equal to 1, 19, 41, 83, 165, 579 days 



Approximate significance of parameter estimates 





Estimate 


Std. Error 


z value 


Pr{> |z|) 


(Intercept) 


2.1959872 


0.0465239 


47.201 


<2e-16 


pml0.1.new(2) 


0.0003945 


0.0012542 


0.315 


0.753 


pml0.2.ncw(2) 


-0.0003446 


0.0016133 


-0.214 


0.831 


pml0.3.new(2) 


-0.0012409 


0.0015482 


-0.802 


0.423 


pml0.4.new(2) 


-0.0012339 


0.0015233 


-0.810 


0.418 


pml0.5.new(2) 


0.0026280 


0.0017240 


1.524 


0.127 


Approximate significance of smooth terms 




cdf 


Chi.sq 


p- value 




S(day) 


8.881 


187.811 


<2e-16 




S(tcmp) 


2.214 


9.840 


0.0799 




S(umr) 


1.888 


5.054 


0.2818 




Global scores 




R-sq.(adj) 


Dev. explained 


UBRE score 


n 




0.502 


51.7% 


0.3608 


579 



5. Discussion and conclusions 

A great deal of uncertainty exists about the extent of life-shortening or the 
increase in morbidity due to the effects of air pollution. A limited amount of 
results from particulate matter time series studies suggests that the increased 
morbidity/mortality risk is not greater for susceptible populations. If a mortal- 
ity/morbidity displacement (harvesting) effect is evident only a few days after 
exposure, then relevance of the findings of the daily time-series studies could be 
questioned, as adverse health effects might be arguably attributed only to the 
low quality of frail individuals at risk. 

To test and estimate the pattern of mortality displacement, we proposed a 
statistical framework based on Singular Spectrum Analysis (SSA), a geometric 
technique derived from dynamical system theory, suitable for constructing eas- 
ily interpretable exposure variables at several timescales. We believe that our 
method is superior from a practical point of view than FFT-based methods: the 
decomposition of the original scries can be seen as a part of a data-driven pro- 
cess, and Fourier frequencies do not need to be fixed in advance. The only free 
parameter is the window length L. The problem at hand can suggest a sensible 
value, for the reason that the main (and only) rule of thumb for stationary series 
containing multiple harmonic frequencies prescribes that periods larger than L 
are confused with long-term effects. 

Promising theoretical developments are reported in [15], where SSA is ex- 
tended to deal with missing data: the issue of missing information is ubiquitous 
in meteorologic and pollutant time-series, and the power of such newer data 
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pre-processing methods has still to be established. Some subjective adjustments 
are still needed during the grouping phase. In particular, a correct separation 
of the dominant period (trend) from sub-harmonic frequencies merged with 
the sub-dominant group is a critical phase: we believe that the data-adaptive 
wavelet-based method introduced in [27] is the correct route towards a fully 
automated data analysis in multi-scale public health time-series studies. More 
efficient functional clustering algorithms will be needed too: for example, the 
shape-based curve clustering procedure described in [18] is very promising. 
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